Introduction to Open Data Science - Course Project

Chapter 1: About the project

Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.

# This is a so-called "R chunk" where you can write R code.

today <- date()
paste("Today is", today, ". Hello R! Hello world!")
## [1] "Today is Thu Nov 19 14:41:35 2020 . Hello R! Hello world!"

My background and expectations

How I found out about this course: I already came across this course last year (unfortunately didn’t have the time to participate) and was already intrigued about the contents and impressed by the fabulous feedback.

My background and expectations: As a graduate student in physics I got quite fond of using python for my data analysis work. Now, I am quite excited to see and compare what is possible with R - I am looking forward to participate in this interesting course! So far, the “warming up” week went fine and the introductions were easy to follow. The syntax shown so far didn’t look too complicated and partially familiar to python or python libraries such as pandas and matplotlib. As I am quite used to python (and it is more and more used in natural sciences), I don’t expect to switch to R after the course, but would love to learn more about it and achieve an understanding of what the pros and cons are. So if something comes up in the future for what R is actually more suited, I will be able to use it or – if collaborators use it – able to understand it. And furthermore, it’s always fun to learn new skills.:)

Practicalities

Link to my GitHub repository : https://github.com/kirstef/IODS-project

Looking forward to next week!


Chapter 2: Regression and model validation

Describe the work you have done this week and summarize your learning.

In this chapter we will explore the dataset from http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS3-data.txt, looking first closer at the dataframe and afterwards visualize some interesting features.

Data wrangling

REMARK: The data wrangling part is done in the .R file in the data-folder. Still, I will show a preview of the original dataset in the diary as it shows the development from the original data to the dataset we finally evaluate.

Looking at the original dataset

First let’s read in the full learning2014 data as a dataframe from the given website.

lrn14 <- read.table("http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS3-data.txt", sep="\t", header=TRUE)
head(lrn14)
##   Aa Ab Ac Ad Ae Af ST01 SU02 D03 ST04 SU05 D06 D07 SU08 ST09 SU10 D11 ST12
## 1  3  1  2  1  1  1    4    2   4    4    2   4   4    3    3    2   3    3
## 2  2  2  2  2  1  1    4    2   4    4    4   2   3    4    4    1   4    1
## 3  4  1  1  1  1  1    3    1   4    4    2   3   4    1    3    1   4    4
## 4  4  2  3  2  1  1    3    3   4    4    3   4   4    2    3    1   3    3
## 5  3  2  2  1  2  1    4    2   5    3    4   4   4    3    4    2   4    2
## 6  4  2  1  1  1  1    4    3   5    4    3   5   5    4    4    1   5    3
##   SU13 D14 D15 SU16 ST17 SU18 D19 ST20 SU21 D22 D23 SU24 ST25 SU26 D27 ST28
## 1    3   4   3    2    3    2   4    2    3   3   2    2    4    4   4    4
## 2    3   2   3    4    4    2   3    1    2   2   3    4    2    4   2    2
## 3    2   4   2    3    3    1   4    3    2   4   3    3    4    4   3    5
## 4    2   4   3    2    3    1   3    3    3   3   3    2    3    2   3    3
## 5    3   4   3    3    4    1   4    3    2   3   3    4    4    3   3    5
## 6    1   5   4    2    3    2   4    3    4   5   4    2    4    2   5    4
##   SU29 D30 D31 SU32 Ca Cb Cc Cd Ce Cf Cg Ch Da Db Dc Dd De Df Dg Dh Di Dj Age
## 1    3   4   4    3  2  4  3  4  3  2  3  4  3  4  4  5  4  2  4  3  4  4  53
## 2    3   3   4    5  4  4  4  5  5  3  2  4  4  3  3  4  3  2  3  3  2  4  55
## 3    2   4   3    5  3  5  4  4  3  4  4  2  1  4  4  1  4  1  3  1  1  5  49
## 4    3   4   4    3  3  4  4  4  3  4  4  3  2  4  5  2  5  1  5  4  2  5  53
## 5    3   3   4    4  2  4  4  3  3  3  4  4  3  4  4  4  4  2  5  5  3  3  49
## 6    2   5   5    3  3  5  4  4  3  4  5  4  3  5  4  4  4  3  4  3  3  5  38
##   Attitude Points gender
## 1       37     25      F
## 2       31     12      M
## 3       25     24      F
## 4       35     10      M
## 5       37     22      M
## 6       38     21      F

Only looking at the head of the dataframe makes it quite visible that some data wrangling will help in understanding the data better. It also shows that meta data is quite valuable as we don’t know by just looking at the variable what their meaning is.

Data analysis

Preparing an analysis sub dataset for further data exploration

For this we have to include the library dplyr (library(dplyr)) which is a common R library used for data wrangling

library(dplyr)

1. Looking at the new data subset: Reading in the students2014 data

First let’s use read.csv() or read.table() to read in the data subset created before and saved as a .csv in the project data-subfolder.

students2014 <- read.csv("./data/learning2014.csv", header = TRUE, sep = ",", stringsAsFactors = TRUE)

With head(students2014) we can display the first six rows of the subset. It is now much better readable and interpretable than before.

##   gender age attitude     deep  stra     surf points
## 1      F  53      3.7 3.583333 3.375 2.583333     25
## 2      M  55      3.1 2.916667 2.750 3.166667     12
## 3      F  49      2.5 3.500000 3.625 2.250000     24
## 4      M  53      3.5 3.500000 3.125 2.250000     10
## 5      M  49      3.7 3.666667 3.625 2.833333     22
## 6      F  38      3.8 4.750000 3.625 2.416667     21

Structure and dimension of the subset

With dim() and str() we can further explore the dimension and structure of the dataset.

dim(students2014)
## [1] 183   7
str(students2014)
## 'data.frame':    183 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...

dim() already gives us that the subset consists of 7 variables and 183 observations. str() gives further information: The variables are those that we defined before (cf. R-script) and consist of 4 combined variables of type numerical which give us the grade that the students obtained in the respective categories in the scale (0-5), 2 integer type variables giving the age and the points, and one character type variable stating the gender.

Getting rid of 0 values We can filter our subset further to e.g. exclude the students who didn’t participate in the final exam and look afterwards again at the dimension.

# select rows where points are greater than zero
students2014 <- filter(students2014, points > 0) 
dim(students2014)
## [1] 166   7

As one can see we have \(183-166=17\) rows with 0 points, so students not taking the final exam. Let’s keep it like this for the further exploration of the data set, as the reasons for not taking the exam are unknown to us and thus don’t necessary have a high relation to the attitude our other variables.

2. Visualizations: Graphical overview of the data

Using ggpairs gives us a good graphical overview of the data and the correlation of the different variables to each other. Using col=gender in the mapping argument let’s us also see gender-differences (color-coded, female(red), male(blueish)).

library(ggplot2)
library(GGally)
# create a plot matrix with ggpairs()
ggpairs(students2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))

Studying the plot in more detail, we see the following parts: scatter plots between the different variables, distributions of the different variables, and correlation values between the different variables. The higher the absolute value of the correlation value, the higher the correlation between the respective variables. As we can seen, the highest correlation seems to be between points and attitude. Looking at the scatter plot between these two variable we can also see that the data points are roughly scattered along a line, which hints as well at a relation between them. The point distribution for the attitude shows a visible gender difference. Where the distribution for the female students rises to about 2.5 to almost a “plateau” (until 3.5), the distribution for the male students shows only a small fraction of people having the grade 2 or less - after that there is a steep rise to the peak at a bit over 3. The surface question distribution (“surf”) shows a maximum at about 3 for the female students, which is higher than for the male students. For the age one can see (as expected) that most of the students are between 20 and 30 years old.

summary(students2014)
##  gender       age           attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :1.400   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :3.200   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :3.143   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :5.000   Max.   :4.917   Max.   :5.000  
##       surf           points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00

The summary() command gives us again an overview on the statistics of the different variables. We have about twice as many female students as male students, the mean age is between 25-26 years, the mean grade for the question types (deep, strategic, surface) lies between 2.8 and 3.7, with deep questions scoring highest. The mean attitude lies at 3.2 and the points at about 23. Apart from these mean values, we get further information: the minimum and maximum values, the 1st and 3rd quartiles and the median.

3. Regression model

From the pairs plot above let’s choose the points as our target value and the 3 variables with the highest correlation to points as explanatory variables to be used for the regression model: exam points ~ x1 + x2 + x3. The respective variables would be attitude (corr: 0.437), stra (corr: 0.146) and surf (corr: -0.140).

summary(model) gives us further information on the validity and significance of our model. Further description and interpretation will still be added.

# create a regression model with multiple explanatory variables
my_model <- lm(points ~ attitude + stra + surf, data = students2014)
summary(my_model)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

To find more out abouth the significance of the correlations we can look at t-value and Pr(>|t|). The understanding of the t-value is here, that it get’s bigger if the Null-Hypothesis is not true. The Null-Hypothesis in this case would be that there is no relation between the variable and the target value. The t-value on its own is difficult to interpret, but Pr(>|t|) gives us then the probability for getting the same t-value if the Null-Hypothesis would be true. As one can see, this probability is very low for attitude, which is also implied by the *** next to the value, which are described as significance codes in the legend. Thus, a relation between points and attitude is quite obvious. Both stra and surf don’t have any significant relation to the target variable, thus I will remove them from the model.

Just for fun once more the model with attitude and stra:

my_model2 <- lm(points ~ attitude + stra, data = students2014)
summary(my_model2)
## 
## Call:
## lm(formula = points ~ attitude + stra, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.9729     2.3959   3.745  0.00025 ***
## attitude      3.4658     0.5652   6.132 6.31e-09 ***
## stra          0.9137     0.5345   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09

The t-value is a bit higher and the Pr(>|t|) a bit smaller, but still the relation is quite insignificant.

4. Reduced regression model: Summary of fitted model

After having also remove the stra variable from the model, the model is reduced to the following one:

my_model3 <- lm(points ~ attitude, data = students2014)
summary(my_model3)
## 
## Call:
## lm(formula = points ~ attitude, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

The values show no a clear significance of the relation between points and attitude. So what does that mean? Apparently, the attitude towards the course influence the final result in the exam (points). That relation is not really surprising: If you like a course and are interested in it, you are probably learning more or studying more for it and will probably easier get more points. The model created now could now be used to predict the points of a person with attitude=x.

Prediction of points for students with different attitudes

new_students <- c("Student X" = 2.5, "Student Y" = 4.8)
new_data <- data.frame(attitude = new_students)

# Predict the new students exam points based on attitude
predict(my_model3, newdata = new_data)
## Student X Student Y 
##  20.45080  28.55936

Plotting the two variables against each other we can see that the predicted points are right where they should be (as it makes sense, because it is the same model). However, we can see that there is still a lot of scattering of the points above and below the regression lines. But, the tendency of the relationship is quite visible. The multiple R-squared value is a measure for how good our fitted model is and how strong the relation. The value of about 0.2 explains for about 20% variation from the mean. So the model might not be perfect, but still might be good. To judge this further, one has to take a look at the residuals.

library(ggplot2)
# plotting the two variables against each other with aesthetic mapping
p1 <- ggplot(students2014, aes(x = attitude, y = points, col=gender))

# using points for visualization
p2 <- p1 + geom_point()

# add a regression line and plot
p3 <- p2 + geom_smooth(method = "lm")
p3
## `geom_smooth()` using formula 'y ~ x'

5. Diagnostics plot

The Residuals vs. Fitted Values plot shows that the errors are quite normally distributed both above and under the 0-line, which is a good sine for our model.

The QQ plot explains the behaviour for most of inner quantiles and only diverges from the line in the outer quantiles.

The residuals vs. Leverage plot can show us if some points have too much leverage (outliers). In this case there is no outlier visible that completely changes the trend of the curve.

#par(mfrow = c(2,2)) #To put the images in the same figures. However, I prefer them a bit bigger right now.
plot(my_model3, which=c(1,2,5))


Chapter 3: Logistic Regression

Data wrangling

The data wrangling part is performed in the R-script create_alc.R in the data folder and the created dataset is saved as a .csv file (‘alc.csv’) in the data folder.

Data analysis

2. Reading the data and description of the dataset

The dataset we want to look at is a joined dataset from two datasets obtained from the following website: https://archive.ics.uci.edu/ml/datasets/Student+Performance , where it is also further described. It studies the performance of secondary education students of two Portuguese schools in mathematics and portuguese. The data sets content are answers of the students to different questions, which concern different areas of working and personal life, social background, circumstance etc.

Here, in our analysis, we will in particular look at the alcohol consumption and its relation to other variables.

To start the data analysis part, we include the library dplyr (library(dplyr)) and read in the dataset.

library(dplyr)
alc <- read.csv("./data/alc.csv")

To get an overview on the dataset, let’s print out the names of the variables and get the dimension values. The meaning of the values can be found on the above-mentioned website.

colnames(alc)
##  [1] "X"          "school"     "sex"        "age"        "address"   
##  [6] "famsize"    "Pstatus"    "Medu"       "Fedu"       "Mjob"      
## [11] "Fjob"       "reason"     "nursery"    "internet"   "guardian"  
## [16] "traveltime" "studytime"  "failures"   "schoolsup"  "famsup"    
## [21] "paid"       "activities" "higher"     "romantic"   "famrel"    
## [26] "freetime"   "goout"      "Dalc"       "Walc"       "health"    
## [31] "absences"   "G1"         "G2"         "G3"         "alc_use"   
## [36] "high_use"
dim(alc)
## [1] 382  36

In total, we have 382 observations and 36 variables in the joined alc dataset.

3. Relation of different values to high/low alcohol consumption

Choose 4 interesting variables from the data and formulate hypothesis towards their relationship with alcohol consumption

Hypothesis 1: sex: There is a statistical difference concerning sex: Male students are more likely to have a high alcohol consumption than female students.

Hypothesis 2 failure: The number of past class failures has a relation to alcohol consumption (more failures leading to more alcohol consumption).

Hypothesis 3 famrel: Good family relationships make high alcohol consume less likely.

Hypothesis 4 goout: The frequency of going out with friends will affect the alcohol consumption.

4. Numerical and graphical exploration of chosen variables

With gather() and gplot() one can first get an overview bar plot for each variable.

library(tidyr); library(dplyr); library(ggplot2)
# draw a bar plot of each variable
gather(alc) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()

Now we can continue by looking at the values interesting for the hypotheses stated above.

Hypothesis 1: gender and alcoholic consumption

alc %>% group_by(sex, high_use) %>% summarise(count=n())
## `summarise()` regrouping output by 'sex' (override with `.groups` argument)
## # A tibble: 4 x 3
## # Groups:   sex [2]
##   sex   high_use count
##   <chr> <lgl>    <int>
## 1 F     FALSE      156
## 2 F     TRUE        42
## 3 M     FALSE      112
## 4 M     TRUE        72

The cross-tabulation is supporting Hypothesis 1. Whereas 21.2121212% of the female students mention a high alcohol consumption, 39.1304348% of the male students have mentioned a high use, which is almost the double value. To look at this graphically, let’s look at a bar plot.

library(ggplot2)
g2 <- ggplot(alc, aes(x = high_use, col=sex))
g2 + geom_bar()

The bar plot shows quite clearly, that the difference between high-use and no high-use are quite different for male and female students.

Hypothesis 2 - failures: Let’s now investigate if one can see a clear relation between the failures and the alcohol consumption.

alc %>% group_by(high_use) %>% summarise(count=n())
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
##   high_use count
##   <lgl>    <int>
## 1 FALSE      268
## 2 TRUE       114
alc %>% group_by(failures) %>% summarise(count=n())
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 4 x 2
##   failures count
##      <int> <int>
## 1        0   334
## 2        1    24
## 3        2    19
## 4        3     5
alc %>% group_by(high_use, failures) %>% summarise(count=n())
## `summarise()` regrouping output by 'high_use' (override with `.groups` argument)
## # A tibble: 8 x 3
## # Groups:   high_use [2]
##   high_use failures count
##   <lgl>       <int> <int>
## 1 FALSE           0   244
## 2 FALSE           1    12
## 3 FALSE           2    10
## 4 FALSE           3     2
## 5 TRUE            0    90
## 6 TRUE            1    12
## 7 TRUE            2     9
## 8 TRUE            3     3
alc %>% group_by(high_use, failures>0) %>% summarise(count=n())
## `summarise()` regrouping output by 'high_use' (override with `.groups` argument)
## # A tibble: 4 x 3
## # Groups:   high_use [2]
##   high_use `failures > 0` count
##   <lgl>    <lgl>          <int>
## 1 FALSE    FALSE            244
## 2 FALSE    TRUE              24
## 3 TRUE     FALSE             90
## 4 TRUE     TRUE              24

As one can see, 5 of the 382 students have experienced 4 failures. From these 5, three report high alcohol consumption. This sample might however be to small to judge if a real relationship exist. Let’s create a bar and a box plot to investigate further.

g3 <- ggplot(alc, aes(x=failures, col=high_use))
g3 + geom_bar()

g4 <- ggplot(alc, aes(x = sex, y = alc_use, col=failures>0))
g4 + geom_boxplot()

Looking at the bar plot, one might at least come to the conclusion, that having experienced no failure makes it less likely to become addicted to high alcoholic consumption. However, the sample for having a failure is much less than for having none. It might make more sense to combine the numbers for failures > 1 together. Now, a box plot gives a nice visualisation of how having experienced at least 1 failure can impact the alcoholic consumption. Again, the impact seems to be much higher for the male students.

Hypothesis 3 - Family:

alc %>% group_by(famrel,high_use) %>% summarise(count=n())
## `summarise()` regrouping output by 'famrel' (override with `.groups` argument)
## # A tibble: 10 x 3
## # Groups:   famrel [5]
##    famrel high_use count
##     <int> <lgl>    <int>
##  1      1 FALSE        6
##  2      1 TRUE         2
##  3      2 FALSE       10
##  4      2 TRUE         9
##  5      3 FALSE       39
##  6      3 TRUE        25
##  7      4 FALSE      135
##  8      4 TRUE        54
##  9      5 FALSE       78
## 10      5 TRUE        24
g5 <- ggplot(alc, aes(x = high_use, y = famrel))
g5 + geom_boxplot() + ylab("quality of family relationships (from 1 - very bad to 5 - excellent)")

g6 <- ggplot(alc, aes(x=famrel, col=high_use))
g6 + geom_bar()

Both the box plot as well as the bar plot seem to at least give a trend on the correctness of the hypothesis: Whereas the mean value for the quality of the family relationships is at 3.5 for people stating a high alcohol consumption, it has a better mean quality (4.5) for students without hinting at alcohol problems. The bar plot gives a bit more clearer view on the different family relationship bins. The high-use fraction is getting more for famrel=2 or 3 than for 4 and 5. For famrel=1 the sample is probably to small to be judge.

Hypothesis 4 - going out

alc %>% group_by(goout,high_use) %>% summarise(count=n())
## `summarise()` regrouping output by 'goout' (override with `.groups` argument)
## # A tibble: 10 x 3
## # Groups:   goout [5]
##    goout high_use count
##    <int> <lgl>    <int>
##  1     1 FALSE       19
##  2     1 TRUE         3
##  3     2 FALSE       84
##  4     2 TRUE        16
##  5     3 FALSE      103
##  6     3 TRUE        23
##  7     4 FALSE       41
##  8     4 TRUE        40
##  9     5 FALSE       21
## 10     5 TRUE        32
g7 <- ggplot(alc, aes(x = high_use, y = goout, col=sex))
g7 + geom_boxplot() + ylab("going out with friends (from 1 [very low] to 5 [very high]")

g8 <- ggplot(alc, aes(x=goout, col=high_use))
g8 + geom_bar()

Again, the numeric cross tabulations as well as the plots are supporting the hypothesis. Going out with friends more often (higher value) seems to have an impact on the alcohol consumption.

5. Logistic regression

# find the model with glm()
m <- glm(high_use ~ sex + failures + famrel + goout, data = alc, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ sex + failures + famrel + goout, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6492  -0.7464  -0.5207   0.8642   2.4194  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.3922     0.6415  -3.729 0.000192 ***
## sexM          0.8976     0.2530   3.548 0.000388 ***
## failures      0.3339     0.2034   1.641 0.100696    
## famrel       -0.3971     0.1390  -2.857 0.004281 ** 
## goout         0.7751     0.1217   6.367 1.93e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 391.08  on 377  degrees of freedom
## AIC: 401.08
## 
## Number of Fisher Scoring iterations: 4

The model summary shows a high significance for the correlation to sexM and goout, a medium significance for famrel und no signifant value for failure. One could fit the model again, dropping at least the failure variable. But first let’s study the odds ratios (OR).

coef(m)
## (Intercept)        sexM    failures      famrel       goout 
##  -2.3922069   0.8976484   0.3338923  -0.3971296   0.7751449
# compute odds ratios (OR)
OR <- coef(m) %>% exp

# compute confidence intervals (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                     OR      2.5 %    97.5 %
## (Intercept) 0.09142769 0.02506625 0.3124735
## sexM        2.45382585 1.50205707 4.0576659
## failures    1.39639281 0.93864152 2.0963627
## famrel      0.67224693 0.51012773 0.8814055
## goout       2.17090677 1.72124671 2.7771652

The odds ratios can tell us if having a certain property (e.g. being male, going out often) can be positively or negatively associated with the logical value looked at (here high alcohol consumption). If the OR is > 1, it means it is positively associated, with < 1 it is negatively associated.

The male sex variable has the strongest relationship to the binary target variable, with an odds ratio of +2.45 - that means that the probability for a male student having high alcohol consumption is about 2.5 higher than for a female student. “going out” is also positively associated with high alcohol consumption, wheres “family relation” has a negative relation (high quality of family relation leading more likely to low alcohol consumption). These OR thus support the above-mentioned hypotheses.

6. Predictive power of Model

As failures didn’t have a significant relationship according to my logistic regression model, I will drop this variable and fit the model again, print out the summary and calculate the OR and Confidence levels:

m2 <- glm(high_use ~ sex + famrel + goout, data = alc, family = "binomial")
summary(m2)
## 
## Call:
## glm(formula = high_use ~ sex + famrel + goout, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5836  -0.7707  -0.5316   0.8198   2.5474  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.3550     0.6414  -3.672 0.000241 ***
## sexM          0.9342     0.2514   3.716 0.000202 ***
## famrel       -0.4118     0.1387  -2.969 0.002989 ** 
## goout         0.7971     0.1214   6.565  5.2e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 393.80  on 378  degrees of freedom
## AIC: 401.8
## 
## Number of Fisher Scoring iterations: 4
coef(m2)
## (Intercept)        sexM      famrel       goout 
##  -2.3549669   0.9341541  -0.4117517   0.7971166
OR <- coef(m2) %>% exp
CI <- confint(m2) %>% exp
## Waiting for profiling to be done...
cbind(OR, CI)
##                     OR      2.5 %    97.5 %
## (Intercept) 0.09489664 0.02602013 0.3241397
## sexM        2.54505957 1.56349803 4.1969640
## famrel      0.66248876 0.50299888 0.8679153
## goout       2.21913298 1.76086023 2.8372779

As a next step let’s check the predictive power of the new model, adding the new columns probability and predictions to the alc dataset. probablities will be calculated by using the predict() function on the model and predictions will defined as TRUE if the probability is higher than 50%.

probabilities <- predict(m2, type = "response")
alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prediction = probabilities > 0.5)

To see an example on how good the model is, we can print the first rows of the columns we are interested in and look at the confusion matrix:

select(alc, sex, famrel, goout,  high_use, probability, prediction) %>% tail(10)
##     sex famrel goout high_use probability prediction
## 373   M      4     2    FALSE  0.18639810      FALSE
## 374   M      5     3     TRUE  0.25195331      FALSE
## 375   F      5     3    FALSE  0.11687357      FALSE
## 376   F      4     3    FALSE  0.16650200      FALSE
## 377   F      5     2    FALSE  0.05627990      FALSE
## 378   F      4     4    FALSE  0.30714360      FALSE
## 379   F      2     2    FALSE  0.17019623      FALSE
## 380   F      1     1    FALSE  0.12243164      FALSE
## 381   M      2     5     TRUE  0.85084788       TRUE
## 382   M      4     1     TRUE  0.09357856      FALSE
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   250   18
##    TRUE     63   51

To graphically visualize the actual values and the predictions, we can draw a plot of ‘high use’ vs. ‘probability’.

g <- ggplot(alc, aes(x = probability, y = high_use, col=prediction))
g + geom_point()

Furthermore, we can do another cross-tab of predictions vs. actual values, printing out the fractions instead and adding margins.

table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.65445026 0.04712042 0.70157068
##    TRUE  0.16492147 0.13350785 0.29842932
##    Sum   0.81937173 0.18062827 1.00000000

Accuracy of the model The proportion of inaccurately classified individuals (training error) can be calculated by taking the values from the confusion matrix: (18+63)/(250+18+63+51) = 0.21. We will get the same value by defiing a loss function as below:

# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2120419

The loss function gives us a value of about 0.21 (as also calculated before), meaning that about 21% of the predictions will be wrong. This value is better than the value in the DataCamp example (about 26%), making this model slightly better.

7. Bonus: 10-fold cross-validation

To study the test set performance of the model one can use K-fold cross-validation. Here, we will try out 10-fold cross-validation, making use of the loss function defined above.

library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m2, K=10)
cv$delta[1] 
## [1] 0.2251309

The cv.glm function of the library boot can be used for this and K=10 defines the number of subsets we are doing the cross-validation on. The delta attribute stores the error value. It gives an error of about 0.23 which is better than the 0.26 error of the DataCamp model. Thus, it has a bit better performance.


Chapter 4: Clustering and classification

1-2. First look at the Boston dataset

This chapter is focused on clustering and classification and the Boston dataset will be the foundation for this. The dataset is already included in the MASS package so it can be loaded directly.

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
data("Boston")

To get a hinch on what dataset we are dealing with, let’s look at its dimensions (dim(Boston)) and structure (str(Boston)) .

## [1] 506  14
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

The dataset is made up of 506 observations and 14 variables, describing the housing values in suburbs of Boston. Two of the variables (chas and rad) are integer type, the other numeric. Further information can be found at https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html. Describe a bit more?

3. Graphical overview of the data and variable summaries

“Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them.”

Let’s see if a `pairs() plot can show us some interesting things:

A summary (summary(Boston)) can give us more insight:

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Both the pairs plot and the summary are not easy to interpret just by a first glance. Another option is to make use of a correlation plot - thus to graphically print the correlation between the different variables.

library(corrplot)
library(tidyverse)
cor_matrix <- cor(Boston) %>% round(digits=2)
corrplot(cor_matrix, method="circle", cl.pos="b")

In this plot we can nicely see, which variables are correlated the most by looking at the colours/diameters of the circles. For example those circles with a darker shade of blue have a correlation value close to 1 and one can compare with the pairs plot above and see that they have a linear tendency (e.g. nox and indus - denoting “nitrogen oxides concentration” and “proportion of non-retail business acres per town”). Then there are those variables which relationship to each other is visualised by a very small and nearly white circle, and those are the once with no significant relationship (e.g. rm and black - “average number of rooms per dwelling” vs. a value of black population in the town), which is also visible in the pair plot.

4. Scaling the data, crime rate, preparing train and test sets

“Standardize the dataset and print out summaries of the scaled data. How did the variables change? Create a categorical variable of the crime rate in the Boston dataset (from the scaled crime rate). Use the quantiles as the break points in the categorical variable. Drop the old crime rate variable from the dataset. Divide the dataset to train and test sets, so that 80% of the data belongs to the train set.”

Scaling the data

boston_scaled <- scale(Boston)
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
boston_scaled <- as.data.frame(boston_scaled) # save data in a dataframe
head(boston_scaled)
##         crim         zn      indus       chas        nox        rm        age
## 1 -0.4193669  0.2845483 -1.2866362 -0.2723291 -0.1440749 0.4132629 -0.1198948
## 2 -0.4169267 -0.4872402 -0.5927944 -0.2723291 -0.7395304 0.1940824  0.3668034
## 3 -0.4169290 -0.4872402 -0.5927944 -0.2723291 -0.7395304 1.2814456 -0.2655490
## 4 -0.4163384 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.0152978 -0.8090878
## 5 -0.4120741 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.2273620 -0.5106743
## 6 -0.4166314 -0.4872402 -1.3055857 -0.2723291 -0.8344581 0.2068916 -0.3508100
##        dis        rad        tax    ptratio     black      lstat       medv
## 1 0.140075 -0.9818712 -0.6659492 -1.4575580 0.4406159 -1.0744990  0.1595278
## 2 0.556609 -0.8670245 -0.9863534 -0.3027945 0.4406159 -0.4919525 -0.1014239
## 3 0.556609 -0.8670245 -0.9863534 -0.3027945 0.3960351 -1.2075324  1.3229375
## 4 1.076671 -0.7521778 -1.1050216  0.1129203 0.4157514 -1.3601708  1.1815886
## 5 1.076671 -0.7521778 -1.1050216  0.1129203 0.4406159 -1.0254866  1.4860323
## 6 1.076671 -0.7521778 -1.1050216  0.1129203 0.4101651 -1.0422909  0.6705582

One can see that as a result to the scaling the values are now distributed around a mean value of 1, running thus from negative to positive.

Categorical variable for the crime rate

The “per capita crime rate by town” can be found in the dataset in the continuous variable “crim”. This continuos variable should now be changed into a categorical one, using quantiles as break boints. Let’s look at the quantiles first with summary(boston_scaled$crim) and then define our bins:

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110
bins <- quantile(boston_scaled$crim)

The new factor variable crime will hold the crime data, categorized by “low”, “med_low”, “med_high” and “high”.

crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label=c("low","med_low", "med_high", "high"))
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127

Instead of the old crim variable, let’s now add the new crime variable to the boston_scaled dataframe and drop the old variable crim.

boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)

Divide the data set into training and test sets

To divide the dataset into training and test sets we first need to know the length of the set, which we can do by using nrow(). A common partition is to use 80% of the data for training and 20% for testing, which we will do here. With `sample() we can randomly choose a percentage of the given numbers and store them in a variable. Then we can build our train and test set with the randomly chosen rows.

n <- nrow(boston_scaled)
ind <- sample(n,  size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]

A table overview on the crime categories of the train and test gives us a first view on the distribution of the different groups.

traincrime <- table(train$crime)
testcrime <- table(test$crime)
traincrime
## 
##      low  med_low med_high     high 
##       96      101      106      101
testcrime
## 
##      low  med_low med_high     high 
##       31       25       20       26

5. Linear Discriminant Analysis

“Fit the linear discriminant analysis on the train set. Use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. Draw the LDA (bi)plot.”

LDA Fit on the crime training set

Linear Discriminant Analysis can be used as a classification method to fine a linear combination of the variables in relation to the target variable classes. It is fir with the function lda() and takes as an input a function (e.g. target ~ x1 + x2 …) and the dataset. target ~ . means that all other variables in the dataset are used as predictors.

lda.fit <- lda(crime ~ ., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2376238 0.2500000 0.2623762 0.2500000 
## 
## Group means:
##                  zn      indus          chas        nox          rm        age
## low       1.0623663 -0.9530923 -0.1082832245 -0.9140409  0.43326255 -0.9120005
## med_low  -0.1268175 -0.2567265  0.0005392655 -0.5337910 -0.13116552 -0.3323438
## med_high -0.3974409  0.2591751  0.2476653031  0.4350980  0.06801733  0.4432166
## high     -0.4872402  1.0171306 -0.1553854963  1.0438456 -0.41609739  0.7962740
##                 dis        rad        tax     ptratio       black       lstat
## low       0.9143239 -0.7055213 -0.7038365 -0.46879175  0.38010683 -0.77477987
## med_low   0.2987374 -0.5588715 -0.4984037 -0.06360983  0.30983088 -0.13432203
## med_high -0.4220659 -0.4130550 -0.2862109 -0.28057075  0.07222323  0.06002397
## high     -0.8507829  1.6379981  1.5139626  0.78062517 -0.99354674  0.89440985
##                 medv
## low       0.51448553
## med_low   0.02011671
## med_high  0.11931825
## high     -0.74239193
## 
## Coefficients of linear discriminants:
##                 LD1          LD2         LD3
## zn       0.09005019  0.773823099 -0.95020469
## indus    0.07196941 -0.382637056  0.34905191
## chas    -0.09759560 -0.069370429  0.03244141
## nox      0.32937560 -0.676267514 -1.27470626
## rm      -0.09165636 -0.099909226 -0.22664229
## age      0.23553763 -0.244514716 -0.22477031
## dis     -0.08472400 -0.288215456  0.21414595
## rad      3.20257477  0.775245940  0.23587861
## tax     -0.06500391  0.205394034  0.16685569
## ptratio  0.10693344 -0.032065637 -0.20201168
## black   -0.14814502  0.009311216  0.08143502
## lstat    0.24155840 -0.324408102  0.32090068
## medv     0.19165234 -0.473199331 -0.12916764
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9414 0.0462 0.0124

The print of the fit shows us the result of the LDA fit on the train set, with the output showing us the three extracted discriminant functions LD1, LD2 and LD3 with the highest value being 0.94 for LD1. We get three discriminant values as we have four different groups (“low”,“med_low”, “med_high”, “high”) and discrimants are always one less than the number of groups.

LDA (bi)plot

For the LDA biplot a function has to be created to show the lda arrows for the different variable in the plot. The code for this function was taken from the datacamp exercise which refers to following Stack Overflow message thread. Then the classes are stored in a numeric vector and plotted, together with the arrows, in a LDA (bi)plot.

lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2,col=classes, pch=classes)
lda.arrows(lda.fit, myscale = 1)

From the plot we can see that the discriminant LD1 was hugely imfluenced by the variable rad. Looking at the definition for rad (“index of accessibility to radial highways”), it seems that the accessibility to radial highways as a predictor plays a role in being placed into the high category.

6. Predictions on the test data

“Save the crime categories from the test set and then remove the categorical crime variable from the test dataset. Then predict the classes with the LDA model on the test data. Cross tabulate the results with the crime categories from the test set. Comment on the results.”

Before going to the next step, let’s save the crime categories from the test set in correct_classes and the remove crime from the test dataset.

correct_classes <- test$crime
test <- dplyr::select(test, -crime)

Now the lda.fit model can be used to predict the crime values of the before-created test dataset:

lda.pred <- predict(lda.fit, newdata = test)

A cross-tabulation with the original crime values (stored in correct_classes) can give a nice overview on how well the fitting worked.

table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       11      17        3    0
##   med_low    4      19        2    0
##   med_high   0      11        8    1
##   high       0       0        0   26

Looking at the cross-tabulation one can see that for this data set the high prediction is perfect, the med-high classification quite good, but for the lower categories low and high the model could be improved.

7. Distances, K-Means and visualization of the clusters

Reload the Boston dataset and scale it to standardize the data, before comparing the distances.

# load MASS and Boston
library(MASS)
data('Boston')

boston_scaled2 <- scale(Boston)

Distances

Without specifying the method as an attribute in the dist function, the euclidian distance is calculated. The method can be changed, e.g. to method="manhattan" which will than have different results.

dist_eu <- dist(boston_scaled2)
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970

K-Means clustering

Let’s do K-Means clustering on the dataset, starting with 4 as a first try for the number of cluster centers. The results can be looked at with pairs.

km <-kmeans(Boston, centers = 4)
pairs(Boston, col = km$cluster)

As it is difficult to read anything in the total pairs plot, let’s divide it into 4 parts:

pairs(Boston[1:4], col = km$cluster)

pairs(Boston[5:8], col = km$cluster)

pairs(Boston[9:11], col = km$cluster)

pairs(Boston[12:14], col = km$cluster)

I will stick to the [5:8] part of the dataset and investigate how the number of clusters (2,3,5) will change the result of the initial number of 4 clusters.

km <-kmeans(Boston, centers = 2)
pairs(Boston[5:8], col = km$cluster)

km <-kmeans(Boston, centers = 3)
pairs(Boston[5:8], col = km$cluster)

km <-kmeans(Boston, centers = 5)
pairs(Boston[5:8], col = km$cluster)

It seems to be difficult to decide which number of clusters is the best. One can use the total of within cluster sum of squares (WCSS) to help with the decision. The optimal number of clusters can be seen as the point when the total WCSS is dropping radically. Let’s investigate the behaviour of the total WCSS from 1 cluster to 10 clusters.

set.seed(123) # use a certain seed for the initial cluster centers
k_max <- 10 # set the maximum number of clusters

twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss}) #calculate the WCSS

qplot(x = 1:k_max, y = twcss, geom = 'line')

The total WCSS is falling steeply until ~2 clusters, afterwards a bit less and between 3 and 4 it is again rising and falling. So we shouldn’t use more than 3 clusters and 2 clusters would probably be the optimal value.

Let’s rerun the K-means clustering with 2 for the whole dataset.

km <-kmeans(Boston, centers = 2)
pairs(Boston, col = km$cluster)

(more chapters to be added similarly as we proceed with the course!)